A Discrete Time Markov Chain Model for High 
Throughput Bidirectional Fano Decoders 



Ran Xu*, Graeme Woodward^, Kevin Moms* and Taskin Kocak* 



*Centre for Communications Research, Department of Electrical and Electronic Engineering 

University of Bristol, Bristol, UK 
^Telecommunications Research Laboratory (TRL), Toshiba Research Europe Limited, 32 Queen Square, Bristol, UK 



Abstract — The bidirectional Fano algorithm (BFA) can achieve 
at least two times decoding throughput compared to the con- 
ventional unidirectional Fano algorithm (UFA). In this paper, 
bidirectional Fano decoding is examined from the queuing theory 
perspective. A Discrete Time Markov Chain (DTMC) is employed 
to model the BFA decoder with a finite input buffer. The 
relationship between the input data rate, the input buffer size and 
the clock speed of the BFA decoder is established. The DTMC 
based modelling can be used in designing a high throughput 
parallel BFA decoding system. It is shown that there is a trade- 
off between the number of BFA decoders and the input buffer 
size, and an optimal input buffer size can be chosen to minimize 
the hardware complexity for a target decoding throughput in 
designing a high throughput parallel BFA decoding system. 

Index Terms — Bidirectional Fano algorithm, high throughput 
decoding, queuing theory, sequential decoding. 

I. Introduction 

Sequential decoding is one method for decoding convo- 
lutional codes [1]. Compared to the well-known Viterbi al- 
gorithm, the computational effort of sequential decoding is 
adaptive to the signal-to-noise-ratio (SNR). When the SNR 
is relatively high, the computational complexity of sequential 
decoding is much lower than that of Viterbi decoding. Addi- 
tionally, sequential decoding can decode very long constraint 
length convolutional codes since its computational effort is 
independent of the constraint length. Thus, a long constraint 
length convolutional code can be used to achieve a better error 
rate performance. There are mainly two types of sequential 
decoding algorithms which are known as the Stack algorithm 
[2] and the Fano algorithm [3]. The Fano algorithm is more 
suitable for hardware implementations since it does not require 
extensive sorting operations or large memory as the Stack 
algorithm [4] [5]. 

High throughput decoding is of research interest due to 
the increasing data rate requirement. The baseband signal 
processing is becoming more and more power and area hungry. 
For example, to achieve the required high throughput, the 
WirelessHD specification proposes simultaneous transmission 
of eight interleaved codewords, each encoded by a convo- 
lutional code [6]. It is straightforward to use eight parallel 
Viterbi decoders to achieve multi-Gbps decoding throughput. 
Since sequential decoding has the advantage of lower hardware 
complexity and lower power consumption compared to Viterbi 



decoding [4] [5], we are motivated to consider the usage of 
sequential decoding in high throughput applications when the 
SNR is relatively high. In a practical implementation of a 
sequential decoder, an input buffer is required due to the 
variable computational effort of each codeword. The contri- 
bution of this work is that the bidirectional Fano decoder with 
an input buffer was modelled by a Discrete Time Markov 
Chain (DTMC) and the relationship between the input data 
rate, the input buffer size and the clock speed of the BFA 
decoder was established. The trade-off between the number of 
BFA decoders and the input buffer size in designing a high 
throughput parallel BFA decoding system was also presented. 

The rest of the paper is organized as follows. In Section II, 
the bidirectional Fano algorithm is reviewed and the system 
model is given. The BFA decoder with an input buffer is 
analyzed by queuing theory in Section III, and the simulation 
results are presented in Section IV. Section V is about choosing 
the optimal input buffer size in designing a parallel BFA 
decoding system, and the conclusions are drawn in Section 
VI. 

II. System Model for BFA Decoder 

A. Bidirectional Fano Algorithm 

In the conventional unidirectional Fano algorithm (UFA), 
the decoder starts decoding from state zero. During each 
iteration of the algorithm, the current state may move forward, 
move backward, or stay at the current state. The decision is 
made based on the comparison between the threshold value 
and the path metric. If a forward movement is made, the 
threshold value needs to be tightened. If the current state 
cannot move forward or backward, the threshold value needs 
to be loosened. A detailed flowchart of the Fano algorithm can 
be found in [1]. In [7], a bidirectional Fano algorithm (BFA) 
was proposed, in which there is a forward decoder (FD) and a 
backward decoder (BD) working in parallel. Both the FD and 
the BD decode the same codeword from the start state and 
the end state in the opposite direction simultaneously. The 
decoding will terminate if the FD and the BD merge with 
each other or reach the other end of the code tree. Compared 
to the conventional UFA, the BFA can achieve a much higher 
decoding throughput due to the reduction in computational 
effort and the parallel processing of the two decoders. A 
detailed discussion on the BFA can be found in [7]. 









'MM 



BFA 
Decoder 

— I — 



Overflow notification 



Fig. 1. System model for BFA decoder with overflow notification from the 
input buffer 



B. System Model 

Since the computational effort of sequential decoding is 
variable, an input buffer is used to accommodate the code- 
words to be decoded. The system model for a BFA decoder 
with an input buffer is shown in Fig. 1. It is assumed that 
there is continuous data stream input to the buffer whose raw 
data rate is Rd bps. The length of the input buffer is B, which 
means that it can accommodate up to B codewords, in addition 
to the one the decoder works on. The clock frequency of the 
BFA decoder is f c ik Hz and it is assumed that the BFA decoder 
can execute one iteration per clock cycle. In the BFA decoding, 
the number of clock cycles to decode one codeword follows 
the Pareto distribution, and the Pareto exponent is a function 
of the SNR and the code rate. A higher SNR or a lower code 
rate results in a higher Pareto exponent [7]. As shown in Fig. 1, 
there is an overflow notification from the input buffer to the 
BFA decoder. The occupancy of the input buffer is observed 
and the currently decoded codeword will be erased if the input 
buffer gets full. As a result, the total number of codewords 
consists of the following: 



total 



= N. 



decoded 



N, 



erased • 



(i) 



In order to evaluate the performance of a BFA decoder affected 
by the introduced parameters such as Rd, f c ik and B, a metric 
called failure probability (Pf) is defined as follows: 



N,, 



p * * erasea * ' erased (2) 

Ntotal N decoded ~T" ^erased 

where Pf is similar to the frame error rate (Pf) which is 
caused by the decoding errors. The total frame error rate is: 

Pt = Pf + Pf- (3) 

In designing the system, Rd, f c ik and B need to be chosen 
properly to ensure that: 



(4) 



In this paper, Pf — 0.01 x Pp is adopted as the target failure 
probability (Ptarget)- How to choose Rd, f c ik and B to make 
a BFA decoder achieve Ptarget will be discussed next. 

III. DTMC based Modelling on BFA Decoder 

The effect of the input buffer has been investigated for 
iterative decoders such as Turbo decoder [8] and LDPC 
decoder [9]-[ll]. The non-deterministic decoding time nature 
of the BFA is similar to that of Turbo decoding and LDPC 



decoding. A modelling strategy similar to that introduced in 
[11] is used to analyze the BFA decoder with input buffer. 

The relationship between the input data rate (Rd), the input 
buffer size (B) and the clock speed of the decoder (f c ik) can 
be found via simulation. Another way to analyze the system 
is to model it based on queuing theory. The BFA decoder with 
an input buffer can be treated as a D/G/l/B queue, in which 
D means that the input data rate is deterministic, G means 
that the decoding time is generic, 1 means that there is one 
decoder and B is the number of codewords the input buffer can 
hold. The state of the BFA decoder is represented by the input 
buffer occupancy (O) when a codeword is decoded, which is 
measured in terms of branches stored in the buffer. 0(n) and 
0(n + 1) have the following relationship: 

0(n + 1) = 0(n) + [T,(n) •#<*-£/], (5) 

where 0(n + 1) is the input buffer occupancy when the n th 
codeword is decoded, T s (n) is the decoding time of the n th 
codeword by the BFA decoder and Lf is the length of a 
codeword in terms of branches, [x] denotes the operation to get 
the nearest integer of x. The speed factor of the BFA decoder 
is defined as the ratio between f c u~ and Rd [1]: 

fcik 

~Rd~' 

If fcik is normalized to 1, Eq. (5) can be changed to: 



(6) 



0(n + l) = 0(n) + [^y-L f }. 



(7) 



The state of the input buffer at time n + 1 is only decided by 
the state at time n and the decoding time T s (n). At the same 
time, T s (n) and T s (n + 1) are Ltd.. As a result, the state of the 
input buffer is a Discrete Time Markov Chain (DTMC). T s (n) 
follows the Pareto distribution for the BFA decoding and is in 
the unit of clock cycle/codeword. The following equation can 
be used to describe the Pareto distribution: 

Prob{T s >T)^A-{-^—y^ (8) 

min 

where T m i n is the minimum decoding time which is Lf 
clock cycles in the considered model. The Pareto exponent 
j3 is a function of the SNR and the code rate. Fig. 2 shows 
the simulated and approximated (based on Eq. (8)) Pareto 
distributions for both the UFA and the BFA at E b /N =4dB 
and 5dB. It can be seen that as the SNR increases, the Pareto 
exponent increases, and for the same SNR the BFA has a 
higher Pareto exponent compared to the UFA. The simulated 
Pareto distribution of T s , which is more accurate compared to 
the approximated distribution based on Eq. (8), will be used 
in the following analysis. The difference between 0(n + 1) 
and 0(n) is defined as: 



A(n) = 0(n + 1) - 0(n) = [-^- - Lf]. 



(9) 



Fig. 3 shows that the total number of states of the input buffer 
with size B is: 

Q = B-Lf. (10) 
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Fig. 2. Simulated and approximated Pareto distributions for the UFA and 
the BFA at E b /N =4t3B and 5dB. The code rate is R=1B. 



The state transition diagram is shown in Fig. 4. As a result, 
the state transition probability matrix of the input buffer is: 
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where Py is the state transition probability from 5, to Sj 



which can be calculated as follows: 

PA +k , 



l^k=A mir 
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.7 = 1 

i<j<n , 
j = n 

_ rmm(T s 



(12) 
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where pa +1 „ = Prob(A = w) and A mirl 
The value of pa +w can be estimated from the simulated dis- 
tribution of T s as shown in Fig. 2. The initial state probability 
(n=0) of the input buffer is: 

tt(0) = (7ri(0),7r a (0), . . . , 7r n (0)) = (1,0,..., 0). (13) 

The steady state probability of the input buffer is then: 

n = lim 7r(n) = lim tt(0) • PZ. (14) 

n— >oo n— too 

The failure probability of the decoder can be calculated by: 

n 

(15) 



p f = ^2n(i)-p+ t 



where = Prob(A > f2— i). The mean buffer occupancy 

can be calculated by: 

Omean = " 5 X 100%. (16) 




Sq 



Fig. 4. Illustration of state transition 



IV. Simulation Results 

Firstly, the semi-analytical results calculated by Eq. (15) 
are compared with the simulation results to validate the 
DTMC based modelling. The simulation setup is shown in 
Table 1. £'f,/A^o=4dB was used as an example, at which 
Ptarget ~ 10~ 3 . The convolutional code in the simulation was 
the one used in the WirelessHD specification [6]. The input 
buffer size B in the simulation takes the buffer within the BFA 
decoder into account. It can be seen from Fig. 5 that the semi- 
analytical results are quite close to the simulation results for 
both the UFA decoder and the BFA decoder, which means that 
the DTMC based modelling is accurate. For the input buffer 
size of 5=10, the working speed factors of the UFA decoder 
and the BFA decoder are about /i=14 and /i=3.6, respectively. 
There is about 290% decoding throughput improvement by 
using the BFA decoder compared to the UFA decoder. If the 
input buffer size increases to B=25, the working speed factors 
will become about /i=8.7 and /i=2.9, respectively, resulting in 
about 200% decoding throughput improvement. As long as 
the distribution of T s is known, Pt can be easily obtained 
for different values of speed factor and input buffer size. 
Simulation time can be greatly saved if the target Pf is very 
low (at high SNR) by using the DTMC based modelling. 
How to use the DTMC based modelling in designing a high 
throughput parallel BFA decoding system will be shown in the 
next section. 

The input buffer occupancy distribution for the BFA decoder 
with B=10 at different speed factors is shown in Fig. 6, which 
was obtained from Eq. (14). The mean buffer occupancy in 
percentage calculated by Eq. (16) is shown in Fig. 7. For both 
5=10 and B=25 whose working speed factors are about 3.6 



TABLE I 
Simulation setup 





1 h 


Generator polynomials 


go = 1338, 9i = 1718, 92 = 165s 


Constraint length (K) 


7 


Branch metric calculation 


1-bit hard decision with Fano metric 


Threshold adjustment value (8) 


2 


Modulation 


BPSK 


Channel 


AWGN 


Information length (L) 


200 bits 


Codeword length (Lf) 


L + K - 1 = 206 branches 



Comparison between semi-analytical and simulation results 
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Fig. 5. Comparison between semi-analytical and simulation results (Pf vs 
fj.) for UFA and BFA at E b /N =4dB 



and 2.9, the mean buffer occupancies are about 17% and 25%, 
respectively. The decoding delay for B=25 is slightly higher 
than that for B=10, while the decoding throughput for B=25 
is higher than that for B=10 as shown in Fig. 5. 

V. Input Buffer Size in Parallel BFA Decoding 

Unlike the Viterbi decoder, it is difficult to use pipelining in 
designing a high throughput BFA decoder due to the irregular 
decoding operations and the variable computational effort. 
Parallel processing is a promising strategy to achieve high 
throughput BFA decoding at multi-Gbps level. In order to 
achieve a specific decoding throughput, a number of BFA 
decoders {Ndecoder) may need to be paralleled (as shown 
in Fig. 8) if a single BFA decoder cannot achieve the target 
average decoding throughput: 



^target — ^decoder 



Rd(B), 



(17) 



where Rd is a function of the input buffer size B. The total 
area of the parallel BFA decoders is: 



A 



total 



— -Adecoder 

— -N decoder 



H~ ^buffer 

■ Abfa + N de 
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If the area ratio between a BFA decoder (Abfa) and an input 
buffer which can hold one codeword (Ab) is a = Abfa/ As, 
Eq. (17) will become: 

Atotal r, /7 -,\ Atotal Rd(B) 



target 



Abfa + B ■ Ab 



Rd{B) 



Ab 



B 



(19) 
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Fig. 6. Buffer occupancy distribution for BFA decoder at 7?(,/JVo=4dB when 
73=10 



Mean buffer occupancy for different speed factors 




Fig. 7. Mean buffer occupancy for BFA decoder at i?b/A r o=4dB when 73=10 
and B=25 



It can be seen from Eq. (19) that for a fixed Atotal and Ab, 
the decoding throughput of parallel BFA decoders changes 
with respect to the input buffer size B. The relationship 
between the input data rate Rd and input buffer size B is 
shown in Fig. 9 which was obtained by the DTMC based 
modelling introduced in Section III. The clock speed of the 
BFA decoder is assumed to be / c ;fc=lGHz. The normalized 
throughput with respect to the maximum throughput for 
different a values is shown in Fig. 10. The value of a 
depends on the technology used in hardware implementation. 
It can be seen from Fig. 10 that there is an optimal choice of 
the input buffer size B to maximize the decoding throughput 
for a fixed area constraint. For example if a=16, the optimal 
choice of the input buffer size will be 10. Equivalently, in 
order to achieve a target decoding throughput, the optimal 
choice of the input buffer size can minimize the hardware 
area, which will be explained by the following example. 

■ Example 

If the target decoding throughput is T tar9et =lGbps and two 
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Fig. 8. Number of decoders vs input buffer size in parallel BFA decoding 



Data rate vs input buffer size at Eb/No=4dB for BFA 
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Fig. 9. Data rate vs input buffer size for BFA at E b /No=4dB 



input buffer sizes Bi=5 and £>2=10 are used, according to 
Eq. (17) and Fig. 9, the number of parallel BFA decoders 
required are: 

Ni = 6 and N 2 = 4. (20) 

When B\=5 is used, the total area of the parallel BFA decoders 
will be: 

Ax=Nx-Abfa+Ni-Bx-Ab. (21) 

When B 2 =\Q is used, the total area of the parallel BFA 
decoders will be: 



A 2 =N 2 - Abfa +N 2 -B 2 -Ai 



(22) 



If a=16, the area reduction by using B 2 =10 compared to Bi=5 
will be: 

V = (4 1 - 1) x 100% 
A 2 

= • - 1) x 100% « 20%. (23) 

N 2 a + B 2 

VI. Conclusion 

In this paper, BFA decoder with input buffer was analyzed 
from the queuing theory perspective. The decoding system 
was modelled by a Discrete Time Markov Chain and the 
relationship between the input data rate, the input buffer size 
and the clock speed of the decoder was established. The 
working speed factor of the BFA decoder at each SNR can be 
easily found by the DTMC based modelling. The DTMC based 
modelling can be used in designing a high throughput parallel 




15 

Input buffer size 



Fig. 10. Normalized throughput vs input buffer size for different a values 
at E b /N =4dB 



BFA decoding system. The trade-off between the number of 
BFA decoders and the input buffer size in designing a high 
throughput parallel BFA decoding system was discussed as 
well. It was shown that an optimal input buffer size can be 
found for a target decoding throughput under a fixed hardware 
area constraint. 
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